12 - Knowledge Discovery in Databases [ID:53569]
50 von 613 angezeigt

Okay, welcome to our last lecture of the semester.

Before we start, as always, are there any leftover questions of last week?

Okay, if not, we will restart at the parametric statistical outlier detection methods

with a Grubbs test. I speeded through that last time and decided to repeat it this time with some

more time to do that. As said last time, Grubbs test is a statistical test to test whether outliers

exist in a univariate data set under a specific assumption, specifically that they are

graded with normal distribution. For this, we have a null hypothesis, specifically that the data set

contains no outliers. To project or allow this null hypothesis, we now do this Grubbs test.

Therefore, we first compute the so-called Grubbs statistic. G is the maximum of a specific data

point xi and the mean of a data set, divided through the sample standard deviation, because

we have the sample standard deviation, and of course also the sample standard deviation.

And now this value g is compared to a boundary, I would call it, where we simply compute the number

of values minus one divided through the square root of n times the square root of a more complicated

formula, at least a more complicated looking formula. However, it's pretty simple if you know

that this t2 in here and this t2 in here are just the values you take out of a t distribution

table at a specific set significance level. As always, you can set the significance level

depending on your needs. It might be 0.05, it might be 0.005, and things like that.

This significance level has to be set by you, and then you can just look this value up in a table.

And then you divide the value you looked up through the number of data points minus two plus

the number you looked up. And now if you computed both g and this more complicated looking formula,

then you can compare both. If g is higher than this significance value, I would call it,

then you reject the null hypothesis. And since the null hypothesis was the data set contains

no outliers, if you reject the null hypothesis, you see that there are outliers.

Of course, this is just possible to do to be done in a univariate data set. Of course, you also might

want to detect outliers in a multivariate data set. So a data set containing two or more attributes,

not only one attribute. This can be done by extending univariate outlier detection methods

like the Kropf's test or using specifically designed outlier detection methods.

Two methods we will talk about right now are the Mann-Hallonobis distance,

sorry if I'm wrong with that, and the chi-squared statistic. Chi-squared is something you already should know.

Let's go to these two. Let's start with the Mann-Hallonobis distance.

Again, sorry if I pronounced that wrong. We measure the distance of an actual object x

to the data set's distribution. So the more distance between a specific data point and the rest of the data set

or the distribution of the data set, the more likely it is to be an outlier.

And basically that's what's done. We take a look at a data point x, again multivariate, so there are

more variables in that, and we compare it to the specific mean of that specific attribute

of our multivariate data set. Transpose that.

Why do we need to transpose that? Because we have multivariates, so we have multiple values in x.

Each one is one variate, one attribute, and with subtraction of the mean of that specific attribute,

we might end up with something like, yeah, if we have a data set with the attributes a, b, c.

Then we calculated this value for each one, and then we end up with a vector, and this vector then has to be

transposed for this calculation to work. Then we take a look at the covariance matrix,

and the covariance matrix is something you should already know from the preprocessing lecture,

if I'm correct, and multiply it again with a non-transposed

calculation of value minus mean.

And once we have this value, we can also use that with the Krupp's test to either

reject our thesis that we have no outliers, or we go with that, that we have no outliers.

Again, the same basic principle, we go with the null hypothesis that we do not have any outliers.

The second option is to check the chi-squared statistic, again, something you should already

know from earlier lectures, where we try to find multivariate outliers by simply calculating the sum

of all values. So if we have n values in our data set, we sum it n times up,

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

01:33:21 Min

Aufnahmedatum

2024-07-15

Hochgeladen am

2024-07-18 09:56:03

Sprache

en-US

Einbetten
Wordpress FAU Plugin
iFrame
Teilen